[CORE-8485] Reset translation state on snapshot #24522

mmaslankaprv · 2024-12-11T12:22:53Z

When an stm receives Raft snapshot it indicates the whole in memory
state of that state machine should be replaced with the state from the
snapshot. The datalake translation state machine was incorrectly
handling raft snapshot which lead to its state being out of date after
the snapshot is applied. Raft snapshot for translation_stm is empty so
the correct action is to reset the state machine state and wait for the
update to be applied.

Fixes: CORE-8485

Backports Required

Release Notes

none

Signed-off-by: Michał Maślanka <michal@redpanda.com>

When an stm receives Raft snapshot it indicates the whole in memory state of that state machine should be replaced with the state from the snapshot. The datalake translation state machine was incorrectly handling raft snapshot which lead to its state being out of date after the snapshot is applied. Raft snapshot for translation_stm is empty so the correct action is to reset the state machine state and wait for the update to be applied. Fixes: CORE-8485 Signed-off-by: Michał Maślanka <michal@redpanda.com>

ztlpn · 2024-12-11T13:53:52Z

src/v/datalake/translation/state_machine.cc

+    // state machine will not hold any obsolete state that should be overriden
+    // with the snapshot.
+    vlog(_log.debug, "Applying raft snapshot, resetting state");
+    _highest_translated_offset = kafka::offset{};


A bit unclear to me why is it okay to throw out the offset like this - as the raft snapshot is empty, STMs on different replicas will necessarily get out of sync. Is it because, if the translation is to be continued from some later point, we hope to get another update in the log?

another update is one thing, the other one is the reconciliation with the datalake coordinator which happens before every translation. The empty snapshot indicates the snapshot is not required by this STM, hence resetting the state here is the only viable option we have

The empty snapshot indicates the snapshot is not required by this STM, hence resetting the state here is the only viable option we have

Yes, but why is it okay to have an empty snapshot?

I was wondering about that, and given that we always commit to the coordinator i think it is safe. Am I right @bharathv ?

I think its ok to reset. Currently this offset is only used to enforce max_collectible_offset on the replica. It is ok to reset because lowering the max collectible offset only delays compaction and has no correctness implications until it catches up again. As Michal said the leader reconciles with the coordinator every time to get the offset_to_translate_from.

@WillemKauf is planning to get rid of translation in this path for read replicas, so we could probably just store the log offset and avoid this kafka offset altogether. This kafka offset was added as an optimization so the coordinator can avoid reconciliation in every round of translation but since then that optimization has been removed to simplify the code, or we could just store a pair of <kafka_offset, log_offset> since it is already in serde and implement the optimization later.

Currently this offset is only used to enforce max_collectible_offset on the replica

is there a reason to be concerned that the scope could increase, this assumption no longer holds, and now there is a problem?

hmm can't think of a scope increase for the offset in the near future. The only reason it exists is to enforce max_collectible_offset. Also as noted, the plan is to get rid of offset translation in this path altogether and make this translation state self contained, and that automatically fixes this problem too.

ahh right makes sense. thanks

src/v/raft/tests/raft_fixture.cc

vbotbuildovich · 2024-12-11T15:19:32Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/59592#0193b5f5-a10e-4615-a139-eb2123b0e78e
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/59592#0193b5dd-1386-49ec-a9a1-a98f1d057b20

vbotbuildovich · 2024-12-12T13:18:05Z

/backport v24.3.x

vbotbuildovich · 2024-12-12T13:19:20Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-24522-v24.3.x-586 remotes/upstream/v24.3.x
git cherry-pick -x f34856c2b7 f454ef290e 8985f5762f

Workflow run logs.

mmaslankaprv added 3 commits December 11, 2024 13:15

raft/tests: added option to enable offset trannslation in tests

f34856c

Signed-off-by: Michał Maślanka <michal@redpanda.com>

datalake/tests: enable offset translation in state machine test

f454ef2

Signed-off-by: Michał Maślanka <michal@redpanda.com>

github-actions bot added the area/redpanda label Dec 11, 2024

mmaslankaprv requested review from bharathv, ztlpn and bashtanov December 11, 2024 12:27

ztlpn reviewed Dec 11, 2024

View reviewed changes

src/v/raft/tests/raft_fixture.cc Show resolved Hide resolved

mmaslankaprv requested a review from andrwng December 11, 2024 16:49

mmaslankaprv enabled auto-merge December 11, 2024 17:37

mmaslankaprv requested a review from ztlpn December 11, 2024 17:37

ztlpn approved these changes Dec 12, 2024

View reviewed changes

mmaslankaprv merged commit f4c472f into redpanda-data:dev Dec 12, 2024
19 checks passed

vbotbuildovich mentioned this pull request Dec 12, 2024

[v24.3.x] [CORE-8485] Reset translation state on snapshot #24548

Open

mmaslankaprv deleted the CORE-8485-reset-translation-state-on-snapshot branch December 12, 2024 16:01

mmaslankaprv mentioned this pull request Jan 9, 2025

[v24.3.x] [CORE-8485] Reset translation state on snapshot #24747

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-8485] Reset translation state on snapshot #24522

[CORE-8485] Reset translation state on snapshot #24522

mmaslankaprv commented Dec 11, 2024 •

edited

Loading

ztlpn Dec 11, 2024

mmaslankaprv Dec 11, 2024

ztlpn Dec 11, 2024

mmaslankaprv Dec 11, 2024

bharathv Dec 11, 2024

dotnwat Dec 13, 2024

bharathv Dec 13, 2024

dotnwat Dec 13, 2024

vbotbuildovich commented Dec 11, 2024 •

edited

Loading

vbotbuildovich commented Dec 12, 2024

vbotbuildovich commented Dec 12, 2024

[CORE-8485] Reset translation state on snapshot #24522

[CORE-8485] Reset translation state on snapshot #24522

Conversation

mmaslankaprv commented Dec 11, 2024 • edited Loading

Backports Required

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Dec 11, 2024 • edited Loading

vbotbuildovich commented Dec 12, 2024

vbotbuildovich commented Dec 12, 2024

mmaslankaprv commented Dec 11, 2024 •

edited

Loading

vbotbuildovich commented Dec 11, 2024 •

edited

Loading